Red wine quality is explored, observed and analyzed in this project. The underlying objective is to understand the chemical properties that influence the quality of red wines. The statistical program, R, is used for this exploratory data analysis where the dataset can be found here and additional literature on the variables can be found here.
The following are some basic statistics on the dataset and the quality variable.
# Summary Statistics
str(wq)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
summary(wq)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
summary(wq$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
From the 1,599 wine observations across 13 numeric variables, it should be noted that X appears to be the unique identifier with quality being the primary output. It is based on a 10-point scale and was rated by at least three wine experts. Interestingly, the wine quality ranged from 3 to 8 with an average of 5.6 and a median of 6. This indicates that the quality variable is ordinal and discrete.
table(wq$quality)
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
The following are histogram and boxplots for the 12 variables to kick off the data visualizations.
Looking at the histograms for all the features, it can be seen that density and pH are normally distributed as well as quality. These can be interesting relationships that will be explored further in subsequent sections.
Other plots seemed to be mostly skewed to the left. Though citric acid appears to have a high number of null values that is concerning. Residual sugar and chlorides seem to have long tails. Let’s see how this trends compare on boxplots next.
These boxplots confirmed many of the trends picked up in the histogram plots. The normal distribution for density and pH can be observed here as well. Likewise, residual sugar and chlorides have a lot of outliers. The distribution of citric acid is fairly odd. Perhaps, sub-setting out the null values might help.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The histogram is slightly better, but the boxplot doesn’t seem to have changed much for citric acid. This could be due to unreported or missing data.
There are 1,599 wine observations across 13 numeric variables where X is the unique identifier and fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality are the 12 features.
The first 11 variables are physicochemical data points on wine samples and the quality is an 10-point scale output based on sensory data from at least three wine experts.
The main feature of interest is quality. From the Univariate Plots Section, it can be observed that quality follows a near normal distribution where the bulk of the observations are in the 5-6 range with some outliers on either end. This can further outlined by using a more pronounced variable rating, such that a quality score of 0-4 denotes a Poor wine, a score of 5-6 denotes an Average wine, and a score of 7+ denotes a Good wine.
## Poor Average Good
## 63 1319 217
Throughout this exploratory data analysis, the drivers of quality will be unearthed and examined.
Similar to quality, density and pH seem to be normally distributed. Fixed and volatile acidity, free and total sulphur dioxide, sulphates, and alcohol seem to be skewed and long-tailed. It is ambiguous as to what features directly affect quality, but from some high-level research, it appears that alcohol content, acidity and pH might be contributors to quality.
Further research failed to highlight the difference in benefit of the different types of acidity in wine. Thus, for the purpose of this project, fixed acid (tartaric acid), volatile acid (acetic acid) and citric acid were combined into a variable named, acidity. It should be also noted that the presence of sulphur dioxide and sulphates indicate the presence of sulphuric acid - this is ignored as being beyond the scope of this project.
A new variable, rating, was defined that categorized the wine quality ratings into Poor, Average, and Good buckets to illustrate its normal distribution. Lastly, a key variable, acidity was declared as a sum of fixed acidity, volatile acidity and citric acid. It is hypothesized that acidity is a driver of wine quality.
The distribution of citric acid is fairly unusual given that the distribution of fixed acidity and volatile acidity on a logarithmic scale conforms to the normal distribution of pH. It appears that citric acid has a large number of null values, which could be incomplete or unavailable data.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The dataset in general was fairly tidy such that additional wrangling was not needed.
The bivariate plots began with a scatterplot matrix. Unfortunately, due to the large file size, generating such a plot took much too long. Instead, a sample of the dataset was used to begin the exploration. Still, the plot was just too messy to be of much use.
The scatterplot matrix knitr chunk was almost silenced as the gigantic plot was too unwieldy to draw meaningful insights from. Nevertheless, the boxplots on rating and some of the correlations seem noteworthy. They were subsequently explored.
These boxplots provided some very interesting insights. It appears that fixed acidity, citric acid, sulphates and alcohol are directly correlated with better wine quality, and volatile acidity and pH are indirectly correlated. The difference in behavior of the acids does bring into question the decision of having a combined acidity variable, but a better assessment will be made in subsequent section.
Lastly, it seems that density doesn’t play a significant part in wine quality. From it’s normal distribution in the univariate section, it was a feature of interest. Perhaps the correlation values might be more kind?
## X fixed.acidity volatile.acidity
## 0.06645261 0.12405165 -0.39055778
## citric.acid residual.sugar chlorides
## 0.22637251 0.01373164 -0.12890656
## free.sulfur.dioxide total.sulfur.dioxide density
## -0.05065606 -0.18510029 -0.17491923
## pH sulphates alcohol
## -0.05773139 0.25139708 0.47616632
## quality rating acidity
## 1.00000000 0.81236704 0.10375373
## X fixed.acidity volatile.acidity
## 0.11527163 0.11423756 -0.39124918
## citric.acid residual.sugar chlorides
## NaN 0.02353331 -0.17613996
## free.sulfur.dioxide total.sulfur.dioxide density
## -0.05008749 -0.17014272 -0.17517368
## pH sulphates alcohol
## -0.05757386 0.30864193 0.47698109
## quality rating acidity
## 0.97556915 0.79200148 0.09282597
Correlation tests were performed on a plain and logarithmic scale. As expected, citric acid, alcohol and, to a lesser extent, fixed acidity had a positive correlation while volatile acidity had a negative correlation to quality. Interestingly, sulphates appeared to have a stronger correlation on a logarithmic scale, and pH seemed to be hardly correlated.
A couple more interesting insights were: - the extremely low correlation of acidity to quality at 10.4%. This proved to be somewhat of a dead end, unfortunately.
- density has a decent correlation of -17.5%. This isn’t the best, but enough to still be of interest.
From the boxplots, it appears that fixed acidity, citric acid, sulphates and alcohol are directly correlated with better wine quality, and volatile acidity and pH are indirectly correlated. From the correlation tests, similar trends were observed with the exception of the pH showing only about 5.7% correlation and suphates having a better correlation of 30.8%.
The acidity and sulphur dioxide relationships were examined.
There seems to be a trend between fixed acidity and citric acid, and volatile acidity and citric acid, but oddly there seems to be no relationship between fixed acidity and volatile acidity. This could be that the underlining chemistry are not dependent upon each other.
As a purely positive control test, the logarithmic relationship of acidity and pH were observed.
## cor
## -0.7044435
As expected, the higher the acidity, the lower the pH value with a correlation coefficient of 70.4%.
The relationship of free and total sulphur dioxide were investigated.
## cor
## 0.6676665
A correlation coefficient of 66.7% indicates that there is a fairly strong relationship between the two sulphur dioxide states. Some research, indicates that sulphur dioxide is an antimicrobial in wine making and that free sulphur dioxide originates from the total.
The strongest relationship to quality were as follows: - alcohol: 47.6% - sulphates (log10): 30.9% - citric acid: 22.6% - fixed acidity: 12.4% - volatile acidity: -39.1% - density: -17.5%
There were six features of interest from the bivariate plots. In this multivariate plot section, they were explored in further detail.
This is a really interesting plot. It appears that both alcohol and sulphates are necessary in a good wine.
Even with the null values removed, it is hard to pick out a decent trend.
These two plots were examined as it was believed that alcohol and volatile acidity would have an interesting interplay due to their polar correlation. The second plot proved to be very telling; it showed a clear distinction of the surface with poor wine (high volatile acidity and low alcohol content) and good wine (low volatile acidity and high alcohol content).
Density didn’t appear to yield much in terms of trend with alcohol.
Citric acid didn’t yield additional insights in visual trends with fixed acidity and volatile acidity.
Density proved to be a dead end. Due to their negative correlation with wine quality, it was expected that density and volatile acidity were correlated in some way. As seen in the plot, it was not so.
Surprisingly, pH had very little visual impact on wine quality, and was shadowed by the larger impact of alcohol.
For the multivariate plots, the features that bore the strongest relationship to quality were observed by splitting the plots by quality score and faceting them by the three rating categories. It can be noted that higher alcohol, sulphates, citric acid, and fixed acidity, and lower volatile acidity leads to better wine quality. This is inline with the insights uncovered thus far.
Since alcohol, specifically ethanol, is a weak acid, it was thought to be somewhat correlated with the presence of other acids, such as citric acid. The plot of alcohol against citric acid above clearly show their lack of correlation to each other.
To close off the discussion around pH, it can be visually observed to not be driver of wine quality when compared with the very obvious alcohol variable. Though, it should be noted that pH is dependent on the concentration of acids in wine, and as such doesn’t seem to vary far from the 3-4 range.
From the numerous plots above, it can be found that acidity, alcohol content and sulphates contribute to good wines. The final plots will illustrate these findings.
It can be noted that not all acids are created equal. These boxplots illustrates that higher fixed acidity (tartaric acid) and citric acid are found in better quality wines. Furthermore, the absence of volatile acidity (acetic acid) also contributed to a higher wine quality. Therefore, a lower pH alone would be a red herring for wine quality. After all, higher acid concentration will lead to a lower pH value, but only tartaric and citric acid seem to benefit wine quality.
This scatterplot shows a trend of higher wine quality ratings with higher alcohol content and lower volatile acidity. Correlation tests performed indicated that alcohol and volatile acidity were the two most correlated features. The dotted lines represent the mean for each respective axes, whereby the bottom right quadrant has a high density of Good wine ratings.
This final plot is perhaps one of the most telling visualization as it illustrates that good wines have an abundance of sulphates and alcohol at the same time. The dotted lines represent the mean for each respective axes, whereby the top right quadrant has a high density of Good wine ratings.
Exploratory data analysis proved to be very effective in understanding relationships within the red wine quality dataset. At the beginning of the analysis, various features were considered of interest, namely, density, pH, fixed acidity, volatile acidity, sulphates, and alcohol. The univariate plots were helpful in getting accustomed to the distribution of the features. But it was ultimately the bivariate plots that yielded key insights of where to examined closer. The multivariate plots revealed key trends that were extremely telling - they added a layer of detail over the bivariate plots that was very helpful and was thus favoured more so.
There were a few slight struggles and dead end throughout this project. The scatterplot matrix using ggplot was very combursome to plot. This was very likely due to dozen plus of variables that were attempted to be plotted, and as such was ineffective in illustrating trends and correlations. Instead, dedicated plots and correlation coefficient were generated against the quality feature. Beyond plotting difficulties, pH, density and the combined acidity variable proved to be dead ends. They were explored at length and with much promise, but untimately was fruitless in displaying any meaningful relationship.
It was found that fixed acidity, citric acid, alcohol content and sulphates positively drive wine quality, and volatile acidity negatively drive wine quality. Boxplots and scatterplots seemed to be the most telling visualization for this dataset. The final plots depict the relationship of acidity to a good wine, and most importantly, such a wine will likely come high alcohol content, high sulphates and low volatile acidity. The final plot also debunked the notion that pH in general was correlated to wine quality.
It should be noted that wine quality is highly subjective on a individual’s taste; a better study would be the inclusion of wine quantities sold in the market. Further analysis using inferential statistics and similar methodologies should be used to verify the findings in this exploration. Nevertheless, the plots here did uncover an interesting and telling story of wine quality in the available observations.